DATA VISUALISATION IN PYTHON

DATASOC: Richard, Bianca, Jason.

Summary

Plots

-Bar Graphs

-Histogram

-Scatter Plots

-Box Plots

-Pie Graphs

-Line Graphs

-Heat Maps

-Pairwise Plots

Functions to produce graphs

How to Export Graphs

Introduction

Libraries

matplotlib.pyplot -> plotting

pandas -> data manipulation

numpy -> working with arrays

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

Importing Dataset

We will be using a dataset on the top songs on spotify by year along with variables such as genre, bpm, length of the song etc.

Taken from kaggle: https://www.kaggle.com/leonardopena/top-spotify-songs-from-20102019-by-year

In [2]:
data = pd.read_csv("top10s.csv", encoding = "ISO-8859-1")
data
Out[2]:
Unnamed: 0 title artist top genre year bpm nrgy dnce dB live val dur acous spch pop
0 1 Hey, Soul Sister Train neo mellow 2010 97 89 67 -4 8 80 217 19 4 83
1 2 Love The Way You Lie Eminem detroit hip hop 2010 87 93 75 -5 52 64 263 24 23 82
2 3 TiK ToK Kesha dance pop 2010 120 84 76 -3 29 71 200 10 14 80
3 4 Bad Romance Lady Gaga dance pop 2010 119 92 70 -4 8 71 295 0 4 79
4 5 Just the Way You Are Bruno Mars pop 2010 109 84 64 -5 9 43 221 2 4 78
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
598 599 Find U Again (feat. Camila Cabello) Mark Ronson dance pop 2019 104 66 61 -7 20 16 176 1 3 75
599 600 Cross Me (feat. Chance the Rapper & PnB Rock) Ed Sheeran pop 2019 95 79 75 -6 7 61 206 21 12 75
600 601 No Brainer (feat. Justin Bieber, Chance the Ra... DJ Khaled dance pop 2019 136 76 53 -5 9 65 260 7 34 70
601 602 Nothing Breaks Like a Heart (feat. Miley Cyrus) Mark Ronson dance pop 2019 114 79 60 -6 42 24 217 1 7 69
602 603 Kills You Slowly The Chainsmokers electropop 2019 150 44 70 -9 13 23 213 6 6 67

603 rows × 15 columns

Rename necessary column names

Rename "top genre" to "genre"

Rename "pop" to "popularity"

In [3]:
data.rename(columns = {"top genre" : "genre", "pop" : "popularity"}, 
            inplace = True)
In [4]:
data.head()
Out[4]:
Unnamed: 0 title artist genre year bpm nrgy dnce dB live val dur acous spch popularity
0 1 Hey, Soul Sister Train neo mellow 2010 97 89 67 -4 8 80 217 19 4 83
1 2 Love The Way You Lie Eminem detroit hip hop 2010 87 93 75 -5 52 64 263 24 23 82
2 3 TiK ToK Kesha dance pop 2010 120 84 76 -3 29 71 200 10 14 80
3 4 Bad Romance Lady Gaga dance pop 2010 119 92 70 -4 8 71 295 0 4 79
4 5 Just the Way You Are Bruno Mars pop 2010 109 84 64 -5 9 43 221 2 4 78

Bar Charts

A convenient way to compare numeric values of several groups.

Why Bar Charts?

Lets us compare quantitative variables.

Good for comparison of multiple variables.

Our Task

Compare the popularity of the top 10 artists by comparing how many top songs they have across the years.

value_counts() -> determines the frequency of each artist and sorts it in descending order.

head(10) -> takes the top 10 artists.

In [5]:
top_10 = data.artist.value_counts().head(10)
top_10
Out[5]:
Katy Perry          17
Justin Bieber       16
Rihanna             15
Maroon 5            15
Lady Gaga           14
Bruno Mars          13
Ed Sheeran          11
The Chainsmokers    11
Shawn Mendes        11
Pitbull             11
Name: artist, dtype: int64

Obtain the top 10 artists (as strings) and put it in a list.

In [6]:
top_10_artists = top_10.index.tolist()
top_10_artists
Out[6]:
['Katy Perry',
 'Justin Bieber',
 'Rihanna',
 'Maroon 5',
 'Lady Gaga',
 'Bruno Mars',
 'Ed Sheeran',
 'The Chainsmokers',
 'Shawn Mendes',
 'Pitbull']

Obtain the frequency of the top 10 artists and put it in a list.

In [7]:
top_10_artists_freq = top_10.tolist()
top_10_artists_freq
Out[7]:
[17, 16, 15, 15, 14, 13, 11, 11, 11, 11]

Plot the bar chart

In [8]:
x = range(10)
plt.bar(x, top_10_artists_freq)
plt.xticks(x, top_10_artists, rotation = "vertical")
plt.title("Top 10 Artists with the most popular songs from 2010 to 2019")
plt.ylabel("Number of top songs")
plt.show()

Histograms

A plot of frequencies or relative frequencies of values within different intervals or 'bins' that cover the range of all observed values in the sample.

Why Histograms?

When we want a graphical summary of quantitative data.

Our task

Compare the bpm of dance pop songs in 2010 and 2018

Obtain the data for only the dance pop genre

In [9]:
dance_pop = data[data.genre == "dance pop"]
dance_pop.head()
Out[9]:
Unnamed: 0 title artist genre year bpm nrgy dnce dB live val dur acous spch popularity
2 3 TiK ToK Kesha dance pop 2010 120 84 76 -3 29 71 200 10 14 80
3 4 Bad Romance Lady Gaga dance pop 2010 119 92 70 -4 8 71 295 0 4 79
6 7 Dynamite Taio Cruz dance pop 2010 120 78 75 -4 4 82 203 0 9 77
7 8 Secrets OneRepublic dance pop 2010 148 76 52 -6 12 38 225 7 4 77
10 11 Club Can't Handle Me (feat. David Guetta) Flo Rida dance pop 2010 128 87 62 -4 6 47 235 3 3 73

Obtain the data for dance pop songs in 2010

In [10]:
dance_pop_2010 = dance_pop[dance_pop.year == 2010]
dance_pop_2010.head()
Out[10]:
Unnamed: 0 title artist genre year bpm nrgy dnce dB live val dur acous spch popularity
2 3 TiK ToK Kesha dance pop 2010 120 84 76 -3 29 71 200 10 14 80
3 4 Bad Romance Lady Gaga dance pop 2010 119 92 70 -4 8 71 295 0 4 79
6 7 Dynamite Taio Cruz dance pop 2010 120 78 75 -4 4 82 203 0 9 77
7 8 Secrets OneRepublic dance pop 2010 148 76 52 -6 12 38 225 7 4 77
10 11 Club Can't Handle Me (feat. David Guetta) Flo Rida dance pop 2010 128 87 62 -4 6 47 235 3 3 73

Obtain the data for dance pop songs in 2018

In [11]:
dance_pop_2018 = dance_pop[dance_pop.year == 2018]
dance_pop_2018.head()
Out[11]:
Unnamed: 0 title artist genre year bpm nrgy dnce dB live val dur acous spch popularity
508 509 One Kiss (with Dua Lipa) Calvin Harris dance pop 2018 124 86 79 -3 8 59 215 4 11 86
509 510 Havana (feat. Young Thug) Camila Cabello dance pop 2018 105 52 77 -4 13 39 217 18 3 85
511 512 New Rules Dua Lipa dance pop 2018 116 70 76 -6 15 61 209 0 7 84
513 514 no tears left to cry Ariana Grande dance pop 2018 122 71 70 -6 29 35 206 4 6 84
514 515 IDGAF Dua Lipa dance pop 2018 97 54 84 -6 8 51 218 4 9 84

Check how many Artists there were in 2010 and 2018

In [12]:
print("Number of top dance pop artists in 2010:", 
      len(set(dance_pop_2010.artist)))
print("Number of top dance pop artists in 2018:", 
      len(set(dance_pop_2018.artist)))
Number of top dance pop artists in 2010: 17
Number of top dance pop artists in 2018: 27

Find the mean and median BPM in 2010 and 2018

In [13]:
print("Mean BPM in 2010:", dance_pop_2010.bpm.mean())
print("Median BPM in 2010:", dance_pop_2010.bpm.median())
Mean BPM in 2010: 123.16129032258064
Median BPM in 2010: 125.0
In [14]:
print("Mean BPM in 2018:", dance_pop_2018.bpm.mean())
print("Median BPM in 2018:", dance_pop_2018.bpm.median())
Mean BPM in 2018: 113.63157894736842
Median BPM in 2018: 109.5

Create the Histogram for BPM of Dance Pop in 2010

In [15]:
plt.hist(dance_pop_2010.bpm, 20, edgecolor = "black")
plt.show()

Create a subplot for BPM of Dance Pop in both years

plt.subplot(nrows, ncols, index)

In [16]:
plt.subplot(2, 1, 1)
plt.title("Distribution of BPM in Dance Pop")
plt.hist(dance_pop_2010.bpm, 20, edgecolor = "black")
plt.ylabel("2010")

plt.subplot(2, 1, 2)
plt.hist(dance_pop_2018.bpm, 20, edgecolor = "black")
plt.ylabel("2018")
plt.show()

Looking at the two graphs we notice that the ranges are different so it can be difficult to compare.

We need to manually specify the range.

In [17]:
plt.subplot(2, 1, 1)
plt.title("Distribution of BPM in Dance Pop")
plt.hist(dance_pop_2010.bpm, 20, range = (40, 180), edgecolor = "black")
plt.ylabel("2010")

plt.subplot(2, 1, 2)
plt.hist(dance_pop_2018.bpm, 20, range = (40, 180), edgecolor = "black")
plt.ylabel("2018")
plt.show()

Scatter Plots

A good way to visualise how two quantitative variables are related in the data.

Our Task

Find the relationship between how energetic a song is and how loud it is.

Plot energy and decibels in 2019

In [18]:
data_2019 = data[data.year == 2019]
data_2019.head()
Out[18]:
Unnamed: 0 title artist genre year bpm nrgy dnce dB live val dur acous spch popularity
572 573 Memories Maroon 5 pop 2019 91 32 76 -7 8 57 189 84 5 99
573 574 Lose You To Love Me Selena Gomez dance pop 2019 102 34 51 -9 21 9 206 58 4 97
574 575 Someone You Loved Lewis Capaldi pop 2019 110 41 50 -6 11 45 182 75 3 96
575 576 Señorita Shawn Mendes canadian pop 2019 117 54 76 -6 9 75 191 4 3 95
576 577 How Do You Sleep? Sam Smith pop 2019 111 68 48 -5 8 35 202 15 9 93
In [19]:
plt.scatter(data_2019.nrgy, data_2019.dB, 10)
plt.title("Energy and Loudness of Songs in 2019")
plt.xlabel("Energy")
plt.ylabel("Decibels")
plt.show()

We can find the correlation between Energy and Loudness of a song

In [20]:
data_2019.nrgy.corr(data_2019.dB)
Out[20]:
0.7462789396350481

We can make a scatterplot for each year

set() -> obtain all the years available

sorted() -> correct order

In [21]:
years_sort = sorted(set(data.year))
years_sort
Out[21]:
[2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
In [22]:
for curr_year in years_sort:
    data_year = data[data.year == curr_year]
    plt.scatter(data_year.nrgy, data_year.dB, 10)
    plt.title(curr_year)
    plt.xlabel("Energy")
    plt.ylabel("Decibels")
    plt.show()

Boxplot (aka Whisker plot)

A Box Plot is the visual representation of the statistical five number summary of a given data set.

A Five Number Summary includes:

  • Minimum
  • First Quartile (25%)
  • Median (Second Quartile) (50%)
  • Third Quartile (75%)
  • Maximum

Screen%20Shot%202020-09-20%20at%2010.45.47%20am.png

When to use?

help visualize the distribution of quantitative values in a field. They are also valuable for comparisons across different categorical variables or identifying outliers, if either of those exist in a dataset.

Note: Note: different software and libraries such as Microsoft Excel, Seaborn and others may place the end whiskers and show outliers differently on box plots. Please understand your software's implementation well when you need to interpret results

In [23]:
# simple demo

value1 = [82,76,24,40,67,62,75,78,71,32,98,89,78,67,72,82,87,66,56,52]
value2=[62,5,91,25,36,32,96,95,3,90,95,32,27,55,100,15,71,11,37,21]
value3=[23,89,12,78,72,89,25,69,68,86,19,49,15,16,16,75,65,31,25,52]
value4=[59,73,70,16,81,61,88,98,10,87,29,72,16,23,72,88,78,99,75,30]

box_plot_data=[value1,value2,value3,value4]

plt.boxplot(box_plot_data,patch_artist=True,labels=['course1','course2','course3','course4'])

plt.show()

boxplot() function takes the data array to be plotted as input in first argument, second argument patch_artist=True , fills the boxplot and third argument takes the label to be plotted.

In [24]:
box=plt.boxplot(box_plot_data,vert=0,patch_artist=True,labels=['course1','course2','course3','course4'])
colors = ['cyan', 'lightblue', 'lightgreen', 'tan']

#zip: Two iterables are passed

for patch, color in zip(box['boxes'], colors):
    patch.set_facecolor(color)
plt.show()

boxplot() function takes argument vert =0 which plots the horizontal box plot. Colors array takes up four different colors and passed to four different boxes of the boxplot with patch.set_facecolor() function.

Pie Graph

expresses a part-to-whole relationship in your data.

When to use?

A pie chart is best used when trying to work out the composition of something. If you have categorical data then using a pie chart would work really well as each slice can represent a different category.

In [25]:
# ex1 

data[data['genre']=='pop']['artist'].value_counts().plot.pie(figsize=(10,10),autopct='%1.1f%%')
plt.title('Plotting of Pop song based on artist in percentage')

plt.show()

#autopct enables you to display the percent value using Python string formatting
In [26]:
# ex2 - with explode / your choice of colour


# Data to plot
labels = 'Python', 'C++', 'Ruby', 'Java'
sizes = [215, 130, 245, 210]
colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue']
explode = (0.1, 0, 0, 0)  # explode 1st slice

# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)

plt.axis('equal')
plt.show()

Line Graph

are usually used to show time series data - that is how one or more variables vary over a continuous period of time.

When to use

Line graphs are used to track changes over short and long periods of time. When smaller changes exist, line graphs are better to use than bar graphs.

Line graphs can also be used to compare changes over the same period of time for more than one group.

In [27]:
import pandas as pd
import matplotlib.pyplot as plt
   
Data = {'Year': [1920,1930,1940,1950,1960,1970,1980,1990,2000,2010],
        'Unemployment_Rate': [9.8,12,8,7.2,6.9,7,6.5,6.2,5.5,6.3]
       }
  
df = pd.DataFrame(Data,columns=['Year','Unemployment_Rate'])
  
df
Out[27]:
Year Unemployment_Rate
0 1920 9.8
1 1930 12.0
2 1940 8.0
3 1950 7.2
4 1960 6.9
5 1970 7.0
6 1980 6.5
7 1990 6.2
8 2000 5.5
9 2010 6.3
In [28]:
plt.plot(df['Year'], df['Unemployment_Rate'], color='red', marker='o')
plt.title('Unemployment Rate Vs Year', fontsize=14)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Unemployment Rate', fontsize=14)
# plt.grid(True)
plt.show()
In [29]:
#num of songs issued by the artist each year 

data['artist'].value_counts().head(10).plot.line(figsize=(20,10))
plt.xlabel('Artist Name')
plt.ylabel('Number of song')
plt.title('Top 10 artist')
plt.show()

Heatmap

Visually represents data with color.

Why Heat Maps?

Lets us discover trends easily in data via color.

Good for comparison of groups.

Plot the a heatmap

How popular are the songs of the top 10 artists? How important are break out hits?

In [30]:
num = 10
top_artists = []
top_song_labels = list(f"#{i} Song" for i in range(1,1+num))
top_song_popularities = np.empty((0, num))
for artist, count in data.artist.value_counts().head(num).iteritems():
    top_artists.append(artist)
    artist_top = data[data.artist == artist].sort_values("popularity", ascending=False).popularity.head(num).tolist()
    artist_top += [0] * (num - len(artist_top))
    top_song_popularities = np.concatenate((top_song_popularities,[artist_top]),axis=0)
In [31]:
fig, (ax1, ax2) = plt.subplots(2,figsize=(20,20))
im = ax1.imshow(top_song_popularities)

# We want to show all ticks...
ax1.axis('tight')
ax1.set(xticks=np.arange(len(top_song_labels)), xticklabels=top_song_labels,
       yticks=np.arange(len(top_artists)), yticklabels=top_artists)

# Rotate the tick labels and set their alignment.
plt.setp(ax1.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")

# Loop over data dimensions and create text annotations.
for i in range(len(top_artists)):
    for j in range(len(top_song_labels)):
        text = ax1.text(j, i, top_song_popularities[i, j], ha="center", va="center", color="w")

ax1.set_title("Popularity of Top 10 songs for the best artists.")
fig.tight_layout()

# Showing Bar Graph to accompany Color Map
x = range(10)
plt.bar(x, top_10_artists_freq)
plt.xticks(x, top_10_artists, rotation = "vertical")
plt.title("Top 10 Artists with the most popular songs from 2010 to 2019")
plt.ylabel("Number of top songs")
plt.show()

Pairplots

Plots joint distributions as a scatterplot.

Plots marginal distribution along the diagonal as histograms.

Useful for looking at correlation and relationships across our variables.

In [32]:
import seaborn as sns
sns.set_style("ticks")
sns.pairplot(data)
plt.show()

We can change the shape of the distribution.

In [33]:
sns.pairplot(data, diag_kind = "kde")  # kernel density estimate
plt.show()

We can choose specific variables for the pairplot.

In [34]:
sns.pairplot(data, vars = ["popularity", "bpm"])
plt.show()

Defining functions for mass producing plots

In [35]:
def mass_plot(data, x_var, y_var, plot_type):
    years_sort = sorted(set(data.year))
    for curr_year in years_sort:
        data_year = data[data.year == curr_year]
        if (plot_type == "bar"):
            plt.bar(data_year[x_var], data_year[y_var], 10)
        elif (plot_type == "scatter"):
            plt.scatter(data_year[x_var], data_year[y_var], 10)
        elif(plot_type == "line"):
            plt.plot(data_year[x_var], data_year[y_var], 10)
        
        plt.title(curr_year)
        plt.xlabel(x_var)
        plt.ylabel(y_var)
        plt.show()
In [36]:
plot_type = "line"
x_var = "bpm"
y_var = "popularity"
mass_plot(data, x_var, y_var, plot_type)
In [37]:
plot_type = "scatter"
x_var = "nrgy"
y_var = "dB"
mass_plot(data, x_var, y_var, plot_type)

Exporting Graphs

We can export our graphs as images with plt.savefig()

In [38]:
for curr_year in years_sort:
    data_year = data[data.year == curr_year]
    plt.scatter(data_year.nrgy, data_year.dB, 10)
    plt.title(curr_year)
    plt.xlabel("Energy")
    plt.ylabel("Decibels")
    
    plt.savefig(str(curr_year), dpi = 200)  # dpi = dots per inch
    plt.clf()  # clears the current plot
<Figure size 432x288 with 0 Axes>

This is the code for the scatterplot of the relationship between the energy and loudness of a song, which we did before. With any plot, we can export it as an image by replacing plt.show() with plt.savefig().

This specific code will export each scatterplot form 2010 to 2019 as an image.

dpi = dots per inch where we can change the resolution of the image.